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Abstract. The CSA-ES is an Evolution Strategy with Cumulative Step size Adap- 
tation, where the step size is adapted measuring the length of a so-called cumu- 
lative path. The cumulative path is a combination of the previous steps realized 
by the algorithm, where the importance of each step decreases with time. This 
article studies the CSA-ES on composites of strictly increasing functions with 
affine linear functions through the investigation of its underlying Markov chains. 
Rigorous results on the change and the variation of the step size are derived with 
and without cumulation. The step-size diverges geometrically fast in most cases. 
Furthermore, the influence of the cumulation parameter is studied. 

Keywords: CSA, cumulative path, evolution path, evolution strategies, step-size adap- 
tation 

1 Introduction 

Evolution strategies (ESs) are continuous stochastic optimization algorithms searching 
for the minimum of a real valued function / : K™ — » K. In the (1,A)-ES, in each 
iteration, A new children are generated from a single parent point X E M.™ by adding a 
random Gaussian vector to the parent, 

X £ K.™ i— > X + <rJSf(0, C) . 

Here, a € is called step-size and C is a covariance matrix. The best of the A 
children, i.e. the one with the lowest /-value, becomes the parent of the next iteration. 
To achieve reasonably fast convergence, step size and covariance matrix have to be 
adapted throughout the iterations of the algorithm. In this paper, C is the identity and 
we investigate the so-called Cumulative Step-size Adaptation (CSA), which is used to 
adapt the step-size in the Covariance Matrix Adaptation Evolution Strategy (CMA-ES) 
[13,10]. In CSA, a cumulative path is introduced, which is a combination of all steps the 
algorithm has made, where the importance of a step decreases exponentially with time. 
Arnold and Beyer studied the behavior of CSA on sphere, cigar and ridge functions 
[1,2,3,7] and on dynamical optimization problems where the optimum moves randomly 
[5] or linearly [ ]. Arnold also studied the behaviour of a (1, A)-ES on linear functions 
with linear constraint [ ] . 

In this paper, we study the behaviour of the (1, A) -CSA-ES on composites of strictly 
increasing functions with affine linear functions, e.g. / : x i-> exp(cc2 — 2). Because 



the CSA-ES is invariant under translation, under change of an orthonormal basis (ro- 
tation and reflection), and under strictly increasing transformations of the /-value, we 
investigate, w.l.o.g., / : x i-> x\. Linear functions model the situation when the current 
parent is far (here infinitely far) from the optimum of a smooth function. To be far from 
the optimum means that the distance to the optimum is large, relative to the step-size a. 
This situation is undesirable and threatens premature convergence. The situation should 
be handled well, by increasing step widths, by any search algorithm (and is not handled 
well by the (1, 2)-erSA-ES [ ]). Solving linear functions is also very useful to prove 
convergence independently of the initial state on more general function classes. 

In Section 2 we introduce the (1, A)-CSA-ES, and some of its characteristics on 
linear functions. In Sections 3 and 4 we study ln((7t) without and with cumulation, 
respectively. Section 5 presents an analysis of the variance of the logarithm of the step- 
size and in Section 6 we summarize our results. 

Notations In this paper, we denote t the iteration or time index, n the search space 
dimension, Af(0, 1) a standard normal distribution, i.e. a normal distribution with mean 
zero and standard deviation 1. The multivariate normal distribution with mean vector 
zero and covariance matrix identity will be denoted A/"(0, J n ), the i th order statistic of A 
standard normal distributions J\fi-\, and !?, : ^ its distribution. If x = (xi, ■ ■ ■ , x n ) £ M. n 
is a vector, then [a;] . will be its value on the i th dimension, that is [xh = A random 
variable X distributed according to a law C will be denoted X ~ C. If A is a subset of 
X, we will denote A c its complement in X. 

2 The (1, A) -CSA-ES 

We denote with X t the parent at the t th iteration. From the parent point X t , A children 
are generated: Y t ,i = X t +at€t,i™ ithi € [I 1 * Ml and £t,i ~A/"(0,I n ), (£t,i)i£[[i,A]] 
i.i.d. Due to the (1, A) selection scheme, from these children, the one minimizing the 
function / is selected: X t+ i — argmin{/(i r ), Y 6 {Yt,i, Y t ,\}}- This latter 
equation implicitly defines the random variable as 

X t+1 =X t + a f e t ■ (1) 
In order to adapt the step-size, the cumulative path is defined as 

Pt+1 = (l-c)p t + y/c(2-c)tt (2) 

with < c < 1. The constant 1 jc represents the life span of the information contained 
in p t , as after 1/c generations p t is multiplied by a factor that approaches 1/e ~ 0.37 
fore— > from below (indeed (1 — c) 1 ^ < exp(— 1)). The typical value for c is between 
and 1/n. We will consider that p ~ A/"(0, I n ) as it makes the algorithm easier 
to analyze. 

The normalization constant yc(2 — c) in front of ^ in Eq. (2) is chosen so that 
under random selection and if p t is distributed according to A/"(0, I n ) then also p t+1 
follows A/*(0, /„). Hence the length of the path can be compared to the expected length 
of ||A/"(0, /„)|| representing the expected length under random selection. 



The step-size update rule increases the step-size if the length of the path is larger 
than the length under random selection and decreases it if the length is shorter than 
under random selection: 

( c ( \\p t+ i\\ 1 

where the damping parameter d a determines how much the step-size can change and 
is set to d a = 1. A simplification of the update considers the squared length of the 
path [5]: 

This rule is easier to analyse and we will use it throughout the paper. We will denote 77^ 
the random variable for the step-size change, i.e. 77^ = exp(c/ (2d CT )(||p f+1 || 2 /n — 1)), 
and for u e R n , rf(u) = aq>(c/(2d <T )(\\u\\ 2 /n - !))■ 

Preliminary results on linear functions. Selection on the linear function, f(x) = [x]i, 
is determined by [X t ] 1 + <r t < [X t ] 1 + a t [£ t)i ] 1 f° r a ^ * which is equivalent to 
< [£< i] 1 f° r au> * where by definition [£ t J is distributed according to M(0, 1). 
Therefore the first coordinate of the selected step is distributed according to Mi-.x and 
all others coordinates are distributed according to M(0, 1), i.e. selection does not bias 
the distribution along the coordinates 2, . . . , n. Overall we have the following result. 

Lemma 1. On the linear function f(x) — x-y, the selected steps (£t)teN of the (1, A)- 
ES are i.i.d. and distributed according to the vector £ := (Afi-.x, A/jj, . . . ,Mn) where 
Mi ~ Af (0,1) fori > 2. 

Because the selected steps are i.i.d. the path defined in Eq. 2 is an autonomous 
Markov chain, that we will denote V = (p t ) te ^. Note that if the distribution of the 
selected step depended on (X t , cr t ) as it is generally the case on non-linear functions, 
then the path alone would not be a Markov Chain, however (X t , crt,p t ) would be an 
autonomous Markov Chain. In order to study whether the (1, A)-CSA-ES diverges geo- 
metrically, we investigate the log of the step-size change, whose formula can be imme- 
diately deduced from Eq. 3: 

ta /W|_ = (M.A (4) 



a t ) 2d, 

By summing up this equation from to t — 1 we obtain 



t „ ,, 2 



t \<J ) 2d a \tf^ in j 

We are interested to know whether j \n(at/cr ) converges to a constant. In case this 
constant is positive this will prove that the (1, A)-CSA-ES diverges geometrically. We 
recognize thanks to (5) that this quantity is equal to the sum of t terms divided by t that 
suggests the use of the law of large numbers to prove convergence of (5). We will start 
by investigating the case without cumulation c = 1 (Section 3) and then the case with 
cumulation (Section 4). 



3 Divergence rate of (1, A)-CSA-ES without cumulation 



In this section we study the (1, A)-CSA-ES without cumulation, i.e. c = 1. In this case, 
the path always equals to the selected step, i.e. for all t, we have p t+1 — . We have 
proven in Lemma 1 that £* are i.i.d. according to This allows us to use the standard 
law of large numbers to find the limit of ~ ln(cr t /t7o) as well as compute the expected 
log-step-size change. 

Proposition 1. Let A a := (E (A/" 2 A ) — l). On linear functions, the (1, \)-CSA- 
ES without cumulation satisfies (i) almost surely lim t _ i . 00 j In (<Jt/vo) = A a , and (ii) 
for all t e N, E(ln(o- t+ i/<r t )) = A a . 

Proof. We have identified in Lemma 1 that the first coordinate of Q is distributed ac- 
cording to N\:\ and the other coordinates according to ftf(Q, 1), hence E (||£t|| 2 ) = 

E + E"=2 E = E (K.x) +n-l. Therefore E (||^|| 2 ) /„ - 1 = 

(E (A/" 2 . A ) - l )/ n - B y applying this to Eq. (4), we deduce that E(hx(a t+ x/a t ) = 
l/(2d (T n)(E(A/' 1 2 A ) - 1). Furthermore, as E(A/" 2 A ) < E((W(0, l)) 2 ) = A 2 < oo, 
we have E(||££ || 2 ) < oo. The sequence || 2 )teN being i.i.d according to Lemma 1, 
and being integrable as we just showed, we can apply the strong law of large numbers 
on Eq. (5). We obtain 



t \<j ) 2d a n 

a... 1 (K{\\C\ 
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□ 



The proposition reveals that the sign of (E (A/" 2 A ) — l) determines whether the 
step-size diverges to infinity. In the following, we show that E (A/" 2 A ) increases in A for 
A > 2 and that the (1, A)-ES diverges for A > 3. For A = 1 and A = 2, the step-size 
follows a random walk on the log-scale. To prove this we need the following lemma: 

Lemma 2 ([ ]). Let g be a real valued function on K. For A > 2, 

(A + l)E( 5 (AA 1:A ))=E( 9 (AA 2:A+1 )) + AE( 5 (M:A+i)) ■ (6) 

Proof, of Lemma 2 

This method can be found with more details in [12]. 

Let Xi = 9 (&)> and X*:A = 9 (&:a)- Note that in general Xi-.X ^ min Xi- The 

i£[[l,A]] 

sorting is made on (^), not on (Xi)- 

We will also note x|. J ? the i l h order statistic after that the variable \j has been taken 
away. w iU be i l h order statistic after Xj-.x has been taken away : if i ^ 1 then we 
have Xi'a = Xi-.X, and for i = 1 Xux = X2-.X- 

Then we have E ( xS ) = Xi:A-i> 



AndELxJS = ££*xgA(2). 

From the first equation we deduce that AE(xi:A-i) = AE(x| l |) = S^=i^(Xi*a) = 

With the second equation, we get that E(£)jLi = E(£*=i Xu) = e (X2:a) + 
(A — 1)E(xi:a)- 

By combining both, we get the final equation: 

(A - 1)(E( X 1:a) - E( X l:A-l)) = E( X l:A-l) " E( X 2:a) 

□ 

We are now ready to prove the following result. 

Lemma 3. Let (jVj)jg [p. All ^ e independent random variables, distributed according to 
Af(0, 1), and Af i: \ the i th order statistic o/(M) ie [[i.A]]- Tnen E {-^l-.i) = E = 
1. In addition, for all A > 2, E (A/"f :A+1 ) > E (JV?. A j. 

Proof, of Lemma 3 The strict monotony of E(A/" 2 . A ) in A from the previous proposi- 
tion is equivalent to show that E(A/" 2 . A ) > E (A/f. A ) for A > 3. Indeed E(Nf. x ) - 
E(A/' 1 2 . A _ 1 ) = E(A/' 1 2 . A — N 2 . x )/\ which follows from Lemma 2 taking g as the square 
function. 

Let£i = {lo G f2\Nf. x (u) < A/f. A (w) }, where = M A andP(u;) = exp(-||w|| 2 /2)/ 
For uj £ f2, let us note Wi : A the i th order statistic of ([w]j)^grri All - Let g be a function 
that maps us E Q to w G 12, where £l> 1:A = — w 2:A , 0j 2 . x = w i:A an d f° r i > 3, 
w, :A = cJj : A. The function g is bijective between E\ and its image by g, E 2 . Let us 
note that for lu g £4, A/" 2 2 A (w) - A/" 2 :A (w) = Af?. x (g(u)) ~ U 2:X {g{<J)), and P(w) = 
P(g(u>)) since the standard normal distribution is symmetric. That is J" (Af 2 . x (ui) — 
J\f^. x (oj))P(ui)duj — f E A^ 2:A (£))-P(w)du) by achange of variables omega = 

g(ui). As according to the definition of for all w G Q\E\ A/" 2 . A (w) > J\f 2 . x (u>), and 
that JSj is properly counterweighted by E 2 in the expected value of A/" 2 . A — M 2 . x , we do 
have E(AA 2 A ) > E(7V 2 A ) for all A > 2. 

For A > 3, let E 3 = {uj G n\uj 3:X e] - \u>u\\, and < w 2: \}. Then, 

for oj G E 3 we also have W2 : a g] — I w 1:a|, |^1:a|[> so -A/' 1 2 . A (aj) > N 2 . x {u) which means 
w ^ or J?! n £ 3 = 0. For ui G as uj\. x < w|. A and w 1:A < w2 : A, w 2:A > 0, 
so ui 3: \ ^ [— w2 : A, ui2 : A]. Hence, as g(u>)3 : \ = w3 : A and [— lu2 : A,w2 : A] = 
[-\g(u)i-.\)\, \g(w)i:\)\], g{u) <£ E 3 . That is E 2 n E 3 = 0. So £3 is disjoint with Ex 
and £?2. Furthermore, for every w e fi, except when wi :A ^ which is a negligible 
subset of events, there exists a non negligible set of (k>i:A)i£[ri,A]l sucn that W 3:A G 
] — I w 1:a|: I w 1:a|[ and Wi :A < w 2:A }. So E 3 is a non negligible subset of J7, where 
A/" 2 . A (w) > N 2 . x (uj). Hence E(A/" 2 A (w)) > E(W 2 2 a (cj)), which is the monotony of 
Lemma 3. 

For A = 1, M:i ~ A/"(0, 1) so E(7V 1 2 1 ) = 1. For A = 2 we have E(A/? :2 +Af}. 2 ) = 
2E(jV(0, l) 2 ) = 2, and since the normal distribution is symmetric E(A/" 2 :2 ) = E(.A/f :2 ), 
hence E(A/?. 2 ) = 1. ' □ 



We can now link Proposition 1 and Lemma 3 into the following theorem: 



Theorem 1. On linear functions, for A > 3, the step-size of the (1, X)-CSA-ES without 
cumulation (c — 1) diverges geometrically almost surely and in expectation at the rate 
l/(2d a n)(E(JV? :A ) - 1), i.e. 

For A = 1 and A = 2, without cumulation, the logarithm of the step-size does 
an additive unbiased random walk i.e. lncr t+ i = ln<7 t + Wt where E[Wt] — 0. More 
preciselyW t ~ l/(2d <T )( X 2 Jn- l)for A = 1, andW t ~ l/(2d ff )((AA 1 2 2 + Xr 2 i _ 1 )/n- 
l)for A = 2, where x\ stands for the chi-squared distribution with k degree of freedom. 

Proof. For A > 2, from Lemma 3 we know that E(A/" 2 . A ) > E(7V 1 2 :2 ) = 1. Therefore 
E(A/" 2 . A ) — 1 > 0, hence Eq. (7) is strictly positive, and with Proposition 1 we get that 
the step-size diverges geometrically almost surely at the rate l/(2d a ) (E(A/" 2 A ) — 1). 

With Eq. 4 we have \n(a t+ i) = hi(a t ) + W t , with W t = l/(2d (T )(||^|'| 2 /n - 1). 
For A = 1 and A = 2, according to Lemma 3, E(Wt) = 0. Hence ln(<7t) does an 
additive unbiased random walk. Furthermore ||£|| 2 = TV 2 A + Xn-v so ^ or ^ = 1> since 
M :1 =AA(0,l),||||j 2 = X 2 . □ 



3.1 Geometric divergence of ( [X t ] i ) te r 



As the selection occurs only on the first dimension, if there is geometric divergence for 
X t , it is on [Xj] r From Eq (1) 
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Summing previous equation from till t — 1 and dividing by t gives us that 

t-i 

t 



[Xo], 
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fe=0 
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[X k ], 



(8) 



Although it is not obvious at first sight, it is important to take the logarithm, as we 
intuitively know that the speed of a t and the speed of X t are connected. The divergence 
rate of a t being log-linear, so should be the one of X t . Let Z_x = 0, and Z t = 

gwlizgdi f or t > 0. 



Z, 



t+i 



Mi - [*o]i _ [Xt+i], - [X ] 1+ a t+1 [C t+1 ] 1 



Zt+i — 



Zt 

4 



using that a t +i = otf]t- According to Lemma 1 (^t)teti is independent over time. As 
ril — exp((||£j \\ 2 /n — l)/(2d a )), (?7^)teN is also independent over time. Therefore, 
Z = (Zt)te$, is a Markov chain. 



By introducing Z in Eq (8), we obtain: 



In 
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^(lnl^l-lnl^.il+lnl^l) 



(9) 



k=0 



The right hand side of this equation reminds us again of the law of large numbers. There 
is no independence over time, but Z being a Markov chain, if it follows some specific 
stability properties of Markov chains, then a law of large numbers may apply. 



Study of the Markov chain Z To apply a law of large numbers to a Markov chain, 
it has to satisfies some stability properties: in particular, the Markov chain T has to 
be (^-irreducible, that is, there exists a measure ip such that every Borel set A of M.™ 
with tp(A) > has a positive probability to be reached in a finite number of steps 
by "P starting from any p £ R ra . In addition, the chain "P needs to be (i) positive, 
that is the chain admits an invariant probability measure tt, i.e., for any borelian A, 
tt(A) = J Rn P(x, A)n(dx) with P(x, A) being the probability to transition in one time 
step from x into A, and (ii) Harris recurrent which means for any borelian A such that 
(fi(A) > 0, the chain V visits A an infinite number of times with probability one. Under 
those conditions, V satisfies a law of large numbers, more precisely: 

Lemma 4. [11, 17.0.1 ] Suppose that <& is a positive Harris chain with stationary mea- 
sure n, and let g be a ir-integrable function that is such that n(\g\) = J Rn \g(x)\ir(dx) < 
oo. Then 

t 

l/i£>(**) r^4 Tr(g) . (10) 

A » t— ¥ OO 

fc=l 

To show that a Markov defined in a space X is positive Harris recurrent, we gener- 
ally show that the chain follows a so-called drift condition over a small set, that is for a 
function V, an inequality over the drift operator AV : x >-> j x V(y)P(x, dy) — V(x). 
A small set is a borel set such that there exists a m £ N* and a non-trivial measure 
v m on P(X) such that for all x £ C, B £ (3{X), P m {x,B) > v m (B). The set C 
is then called a ^ m -small set. The chain also needs to be aperiodic, that is there is no 
d-cycle, that is disjoint Borel sets (-Di)ie[[i,ti]] I sucn that for x £ Di, P(x, -Di+i) = 1 
for i = • • • d — l(modd), and [uf =1 ] c is ^-negligible. If there exists a i/i-small-set A 
such that vi(A) > 0, then the chain is strongly aperiodic (and therefore aperiodic). We 
then have the following lemma. 



Lemma 5. [11, 14.0.1] Suppose that the chain $ is (f-irreductible and aperiodic, and 
f > 1 a function on X. Let us assume that there exists V some extended-valued non- 
negative function finite for some x G X, a small set C and del such that 

AV(x) < -f(x) + bl c (x) ,xeX. (11) 

Then the chain # is positive Harris recurrent with invariant probability measure 7r and 

n(f) = f ir(dx)f(x) < oo . (12) 

To prove the irreducibility, aperiodicity and to exhibit the small sets of the Markov 
chain Z through its transition kernel would be difficult. Instead, it can be done by show- 
ing some properties of its underlying control model. In our case, the model associated 
to Z is called a non-linear state space model. We will, in the following, define this 
non-linear state space model and some of its properties. 

Suppose X = {Xk}, Xk G X. If there is a smooth function (C°°) F such that 
Xk+i = F(Xf., W k+i) with (W"i)i £ N being a sequence of i.i.d. random variables, 
whose marginal distribution r possesses a semi lower-continuous density j w which is 
supported on an open set O w ; then X is called a non-linear state space model driven by 
F or NSS(.F) model, with control set O w . 

We define its associated control model CM(F) the deterministic system xj. = 
F k (xt),ui,- ■ ■ ,w fe ),wherei ;l fc isgivenbyF fe (a;o,ui,--- ,Ufc) = F(F k _ 1 (x ,u 1 , ■ ■ ■ ,u k -i),u k ), 
and Fq(xo) — x , provided that nes m me control set O w . 

For a point x € X, and k 6 N we define A+(x) — {Fk(x,ui, ■ ■ ■ ,Uk)\ui £ 
O w Vz € N}, the set of points reachable from x after k steps of time. And A + (x) = 

UiGN A l + {x). 

The associated control model CM(F) is called forward accessible if for each x E 
X, the set A + (x ) has non empty-interior. 

Let E be a subset of X. We note A + (E) = {J xeE A + (x), and we say that E is 
invariant if A + (E) c E. We call a set minimal if it is closed, invariant, and does not 
strictly contain any closed and invariant subset. Restricted to a minimal set, a Markov 
chain has strong properties, as stated in the following lemma. 

Lemma 6. [11, 7.2.4, 7.3.5] Let M C X be a minimal set for CM(F). IfCM(F) is 
forward accessible then the NSS(F) model restricted to M is an open set irreducible 
T-chain. 

Furthermore, if the control set O w and M are connected, and that M is the unique 
minimal set of the CM(F), then the NSS(F) model is a ^-irreducible aperiodic T-chain 
for which every compact set is a small set. 

We can now prove the following lemma: 

Lemma 7. The Markov chain Z is open set and ip-irreducible, aperiodic, and com- 
pacts ofM. are small- sets. 

Proof. This is exactly the result of Theorem 6 when all conditions are fulfilled. We then 
have to show the right properties of the underlying control model. 



If we note F(X k , W k+ i) = X k exp {-\/2d a (\\W k+1 \\ 2 /n - l)) + [W fc+1 ]i. 
then we do have Z t+ i = F(Z t , £*). The function F is smooth (it is not smooth along 
the instances £ ti , but along the chosen step). Furthermore, the distribution of admits 
a continuous density, whose support is R™. Therefore the process Z is a NSS(F) model 
of control set R n . 

We now have to show that the associated control model is forward accessible. Let 
z£l. When — > ±00, F(z, ££) — » ±00. As F is continuous, for the right value 
of any point of R can be reach. Therefore for any z G R, = R. The set R 

has a non-empty interior, so the CM(F) is forward accessible. 

As from any point of R, all of R can be reached, the only invariant set is R itself. It 
is therefore the only minimal set. Finally, the control set O w = M. n is connected, and so 
is the only minimal set, so all the conditions of Lemma 6 are met. So the Markov chain 
Z is V'-irreducible, aperiodic, and compacts of R are small-sets. □ 

We may now show Foster-Lyapunov drift conditions to ensure the Harris positive 
recurrence on the chain Z. In order to do so, we will need the following Lemma: 

1 llfll 2 

Lemma 8. Let exp(— 23~(^r I)) be denoted if . For all \ > 2 there exists a > 

such that 

E(V~") - K0 . (13) 
Proof. Using the Taylor series of the exponential function we have 




According to Lemma 3 E OVJ^) > 1 for A > 2, so when a goes to we have 
E(?7*- a )<l. " □ 

We are now ready to prove the following lemma: 

Lemma 9. The Markov chain Z is Harris recurrent positive, and admits a unique in- 
variant measure fi such that for f: x <— > \x\ a € R n(f) = J R fi(dx)f(x) < 00, with a 
such that Eq. (13) holds true. 



Proof. By using Lemma 7 and Lemma 5, we just need the drift condition (11) to prove 
Lemma 9. Let V be such that for V(x) = \x\ a + 1. 

AV{x)= J P(x,dy)V(y)-V(x) 

= I p f i + 6 iv) (1 + M Q ) - (1 + M") 
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< |x| Q E (ry*- a -l) + E 
AV(x) \x\ a w f._ a 



V(x) l + |x| c 

|z| — >co V(x) V 



E 7?*" -1 



l + \x\ 



1 



We take a such that Eq. (13) holds true (as according to Lemma 8, there exists 
such a a). As E(rj*~ a - 1) < 0, there exists e > and M > such that for 
all |x| > M, AV/V(x) < -e. Let & be equal to E([£]i) + eV(M). Then for all 
|at| < M, AV(x) < -eV(x) + b. Therefore, if we note C = [-M, M], which is 
according to Lemma 7 a small-set, we do have AV(x) < — eV(x) + blc(x) which is 
Eq. (11) with / = eV. Therefore from Lemma 5 the chain Z is positive Harris recur- 
rent with invariant probability measure /i, and eV is /i-integrable. As J R fi(dx)\x\ a = 
1/e J. „[i(dx)eV(x) — 1 < oo, the function x i-> \x\ a is also ^-integrable. 

□ 

In order to use Lemma 4 on Z with the function g : x i-> E (hi\x/x — 
we must prove that this function is /i-integrable, that is Lg(u)fJ,(du) < oo. To do so 
we will need the following lemma on the existence of moments for stationary Markov 
chains: 

Lemma 10. Let Z be a Harris-recurrent Markov chain with stationary measure [i, on 
a state space (S, J 7 ), with T is a -field of subsets of S. Let f be a positive measurable 
function on S. 

In order that J s f(z)p(dz) < oo, it suffices that for some set A G T such that 
< Li(A) and J A f(z)p(dz) < oo, and some measurable function g with g(z) > f(z) 
for z E A c , 

1. 

P(z,dy)g(y)<g(z)-f(z) , Vx G A c 



sup / P(z, dy)g(y) < oo 

ztEA J A" 



We may now prove the following theorem: 



Theorem 2. On linear functions, for A > 3, the absolute value of the first dimension 
of the parent point in the (1, X)-CSA-ES without cumulation (c = 1) diverges geomet- 
rically almost surely at the rate ofl/ (2d a n)E(J\f^. x — 1), i.e. 



In 



[*tli 



2d„n 



(E (Af? :X ) - 1) 



(14) 



Proof. We will first prove here that the function g : x h-> In \x\ is /i-integrable. From 
Lemma 9 we know that the function / : x i-> \x\ a is /i-integrable, and as for any 
M > 0, and any x £ [-M, M] c there exists K > such that K\x\ a > | In \x\\, then 
glA a is /i-integrable, with A = [-M, M], So what is left is to prove that gl^ is also 
/i-integrable. We will now check the conditions to use Lemma 10. 

According to Lemma 7 the chain Z is open-set irreducible, so p(A c ) > 0. For 
C > 0, if we take h : z ^ C) \f\z\, with M small enough we do have for all z £ A c , 
h(z) > \g(z)\. Furthermore, if we study the inequality 



P(z,dy)h{y)<h(z) 



P 



[^] ie dy)l A c{y)-^=< 



C 



E 



y\ 
\ 

) 



< 



-\ln\z\\ 

|lnN| 
C 



We can increase C up until | In \z\ \/C is negligible compared to 1/ ^J\z\, and we can 
decrease M to make E f 1/ \J\zjr\* + \1a c {z/rj* + as small as we would 

like it to be, as it decreases the size of A c , so the inequality holds if we choose M and 
\/C small enough. The second inequality for Lemma 10 holds as well: 

f C 

P(u,dv)h(v) < 



dv = ACvM < oo 



Finally, according to Lemma 9, the chain Z is Harris recurrent. So Lemma 10 
shows that g is /i-integrable. This allows us to apply Lemma 4 to the function g: 



Kg)- 



With Lemma 1 we can apply a strong law of large numbers upon 1 jt J2k=o ^ n \ r lk-i I = 
1 A El=o V ( 2c ^ ) (£fe-i t0 8 et as in the P roof of Proposition 1 l/(2d a n) {E(Nf. x )- 
!)■ 

By inserting these results into Eq. (9), we get that 1 /£ In | [X t } 1 j [Xq] 1 \ 

t—too 

p,(g) ~ p(g) + l/(2d (J n)(E(J\f^. x ) — 1), which with Lemma 3 is strictly positive for 
A > 3. I : 



4 Divergence rate of CSA-ES with cumulation 



We are now investigating the (1, A)-CSA-ES with cumulation, i.e. < c < 1. 



According to Lemma 1, the random variables (£*)teN are i-i-d., hence the path "P = 
(Pt)teN is a Markov chain. By a recurrence on Eq. (2) we see that the path follows the 
following equation 

t-i 

fe=o s * ' 

l.i. d. 

For i / 1, [£,t\i ~ A/"(0, 1) and, as also [p Q ]i ~ A/"(0, 1), by recurrence [p t ]i ~ 
A/"(0, 1) for all t G N. For i = 1 with cumulation (c < 1), the influence of [p ]i 
vanishes with (1 — c)*. Furthermore, as from Lemma 1 the sequence ([^(] 1 ])teN is 
independent, we get by applying the Kolgomorov's three series theorem that the series 
Sfc=o(l ~ c ) fc [£t-i-fc] i converges almost surely. Therefore, the first component of 
the path becomes distributed as the random variable [PqJi = \/c(2 — c) Ylk^oO- ~ 
c ) fc Kfc]i (by re-indexing the variable £t-i-k m £fe> as tne sequence (£j )tgN is i.i.d.). 

As in Subsection 3.1 we will show that V has the right stability properties to apply 
a law of large numbers to it. First we will extract from "P the part of interest as stated 
in the following lemma. 

Lemma 11. On linear functions, for any A the step-size of the (1, X)-CSA-ES follows 
almost surely 

t \(To / 2d a n \t f—' J t-yoo 

and in expectancy 

Proof. We separate Eq. (5) over the dimensions, which gives us that 1/t ln(o- t /<7 ) = 
c/(2d CT n)(Er=i lAEi=>i]f " n), so l/tln(*tAn>) - c/(2oU)(l/i - 
1) = EL 2 c/(2d ff n)(l/tE*=i[Pi]f-l)-Asfori^l,[p ] i Ar(0,7„)and[^] i ha S no 
selection pressure, then [p 1 ]iA/'(0, /„), and per recurrence [p k ]iAf(0, I n ) for any fc 6 
N. Therefore, we can apply the strong law of large numbers and l/t)]' =1 [p,-l? — 1, 
which gives us Eq. (16). 

The same reasoning over Eq. (4) gives Eq. (17). □ 

The part of V left to analyse is its first dimension [P]i = ([pj i)i<=n- We start the 
study of [P] i with the following lemma. 

Lemma 12. The Markov chain [P] i is ip-irreducible, aperiodic, and compacts ofR are 
small-sets. 

Proof. We have the following transition kernel: 



P(p,A)= / 1 A ((1 - c)p + Vc(2 - c)u) P(M:A = u)du 



With a change of variables u = (1 — c)p\J c(2 — c)u, we get that 




pfeA) =^/. lj(i,p r i= w^f) rfi - 

As P (A/i:A = x) > for all for all A non-//£ e fc-negligible we have P(p, A) > 

0, thus the chain [P] \ is /^eft-irreducible. 

Furthermore, if we take C a non-/i^ e ft-negligible compact of M, and vc a measure 
such that for A a borel set of E, 

^ c(A) = ^ R 1a ({t) ™ c p (** = ({t " (1 " c)p) 7 (^( 2_c ) c ) ) d{t ' we see 

that P(p,A) > vc{A) for all p e E, while is not a trivial measure (indeed, 
-P(A/i : A = a;) > fc > for all x € C); Therefore compact sets of E are small sets 
for ["P]i. Finally, vc{C) > 0, so the chain [P]i is strongly aperiodic. □ 

We use this new lemma with Lemma 5 to prove what is needed to apply the law of 
large numbers on [P] i . 

Lemma 13. The chain [P] i is Harris recurrent positive with invariant measure fJ, pa th, 
and the function x i— >• x 2 is [i pa th-integrable. 

Proof. We now have to get the right drift condition for the chain. Let V : x i-> x 2 + 1. 



AV{x) = / V{y)P(x,dy)-V(x) 
Jw 

AV{x) = f (y 2 + 1) P ((1 - c)x + y/c(2 - c) G dy) - (x 2 + l) 
AV{x) = E ^(l - c)x + y/c(2-c) [£*]i) 2 + 1^ - x 2 - 1 

4V(a:) < ((1 - c) 2 - l)x 2 + 2|x|\/c(2-c)E ([£*y + c(2 - c)E (fcV) 
47(i) . , x 2 2|a:|- x /c(2-c) c(2 - c) : 

km — V < -c(2 - c 

|s|-H>o ^(x) 

As < c < 1, c(2 — c) is strictly positive and therefore, for e > there exists 
C = [-M, M] with M > such that for all x e C c , ZiF(ar)/F(x) < -e. If we take 
& = eV(M) + 2M v /c(2 - c)E(| I) + c(2 - c)E([|*] 1 2 ), then for all x e C we 
have AV(x) < b. Hence the drift condition AV(x) < — eV(x) + blc is satisfied for 
all xeR. 

According to Lemma 12 the chain [P]i is ^-irreducible and aperiodic, so with 
Lemma 5 it is positive Harris recurrent, with invariant measure fJ, pa th, and V is ^ pa th- 
integrable. Therefore the function x H> x 2 is also /i pat ^-integrable. 

To obtain an equality between the rate we get through almost sure divergence, and 
the rate in expectation, we need to define the /-norm, which for a signed measure v 
and a function / > 1 is equal to \\u\\f — sup ff .i ff |<^ |f (<?)|, and we need the following 
lemma. 



Lemma 14. [1 1 , 14.3.5] Suppose $ is an aperiodic positive Harris chain on a space 
X with stationary measure it, and that there exists some non-negative function V, a 
function f > 1, a small-set C and b € K such that for all x € X, AV(x) < —f(x) + 
blc(x). Then for all initial probability distribution v, \\vP n — 7r|| t — > 0. 

t— >oo 

We now obtain geometric divergence of the step-size and get an explicit estimate of 
the expression of the divergence rate. 

Theorem 3. The step-size of the (1, X)-CSA-ES with A > 2 diverges geometrically fast 
if c < 1 or A > 3. Almost surely and in expectation we have for < c < 1, 

m 



t 



>o for A>3 and for x=2 and c<i 



Proof. We will start by the convergence in expectation. From Eq. (17) we see that the 
parttodevelopisE([p t+1 ] 1 ).Byrecurrence [p t +i] 1 = (l-c)* +1 [p ]i + v /c(2 - c) ELoC 1 " 
c) 1 [^t_i] j ■ When £ goes to infinity, the influence of [p ] i in this equation goes to with 
(1 — c) t+1 , so we can remove it when taking the limit: 



i=0 



We will now develop the sum with the square, such that we have either a product 
[£t-i\ 1 1 w i m * 3< or [^t— This way, we can separate the variables by 

using Lemma 1 with the independence of over time. To do so, we use the develop- 
ment formula (E"=i a n) 2 = 2 Y17=i Y^j=i+i a i a j + E"=i a i ■ We take the limit of 
E( [Pt+i] ^) and find that it is equal to 

/ \ 



lim c(2— c) 

t— too 



2E £a-<o i+J ' E ([tUi +E( 1 - c ) 2lE ([«; 

i=0j = i+l s v ' i=0 v v 

V =E[££_ i ] 1 E[<; t %.] 1 =E[Ar 1:A P / 

(20) 

Now the expected value does not depend on i or j, so what is left is to calculate 

E*=oE-= i+1 (l " and E*=o(l - ^ ■ We have ELqEW 1 " ^ = 

E* = o(l " c) 2l+1 1 ~^~^ > e ^ ) and when we separates this sum in two, the right hand side 
goes to for t — > oo. Therefore, the left hand side converges to lim^oo Ej=o(l — 
c) 2i+1 /c, which is equal to limt^oofT — c)/c^ i=0 (l — c) 2 *. And Ei=o(l — c ) 21 ls 
equal to (1 - (1 - c) 2 * +2 )/(l - (1 - c) 2 ), which converges to l/(c(2 - c)). So, by 

inserting this in Eq. (20) we get that E ([p t+1 ] I) — > 2±^E {Af 1: \ f + E (7V 2 A ), 



which gives us the right hand side of Eq. (18). 

By summing E(ln(<7i + i/<7i)) for i = 0, . . . ,t — 1 and dividing by t we have the 
Cesaro mean l/tE(ln(cr t /cro)) that converges to the same value that E(ln(<7f+i / at)) 
converges to when t goes to infinity. Therefore we have in expectation Eq. (18). 



We will now focus on the almost sure convergence. From Lemma 13, we see that we 
have the right conditions to apply Lemma 4 to the chain \P]\ with the /ipaj^-integrable 
function g : x M> x 2 . So 1/t J2k=i\Pk\i ^path(g)- With Eq. (16) we obtain that 

t— >-oo 

l/iln(t7 t /cr ) -^4 c/{2d a n){iJL P ath{g) - 1). 

t— >oo 

We will now prove that ii pa th{g) — linit-i-oo E([p (+1 ] -J- Let ^ be the initial dis- 
tribution of [p ]l, so we have |E([p t+1 ] x ) - /U pa th(sOI < ||^P* +1 - /Vtftlk- with 
h : x i y 1 + x 2 . From the proof of Lemma 13 and from Lemma 12 we have all 
conditions for Lemma 14. Therefore ||j/P t+1 — fj, pa th\\h — > 0, which shows that 

t— >oo 

Vpathig) =lim t _, 0o E([p tfl ] \) = (2-2c)/cE(AA 1:A ) 2 + E(AA 1 2 A ). 

According to Lemma 3, for A = 2, E(A^. 2 ) = 1, so the RHS of Eq. (18) is equal to 
(1 — c) /(d .n)E(A/'i:2) 2 - The expected value of Mx-,2 is strictly negative, so the previous 
expression is strictly positive. Furthermore, according to Lemma 3, E(Af 2 . x ) increases 
strictly with A, as does E(A/i : 2) 2 - Therefore we have geometric divergence for A > 2 if 
c < 1, and for A > 3. 

□ 

From Eq. (1) we see that the behaviour of the step-size and of (X t )ten are directly 
related. Geometric divergence of the step-size, as shown in Theorem 3, means that 
also the movements in search space and the improvements on affine linear functions 
/ increase geometrically fast. Analyzing (X t ) te ^ with cumulation would require to 
study a double Markov chain, which is left to possible future research. 

5 Study of the variations of In (cr t+1 / cr t ) 

The proof of Theorem 3 shows that the step size increase converges to the right hand 
side of Eq. (18), for t oo. When the dimension increases this increment goes to 
zero, which also suggests that it becomes more likely that er i+ i is smaller than at- To 
analyze this behavior, we study the variance of In (<7 t+ i/er t ) as a function of c and the 
dimension. 

Theorem 4. The variance o/ln (at+i/crt) equals to 

v "H^)) = d?H p '^-< M ^ +2ln - i} ) ■ <21) 

Furthermore, E ( [p t+1 ] ( — > E (N 2 . x ) + (A/"i :A ) 2 and with a=l-c 

( 4\ fl — a 2 ) 2 
lim E ( [p t +i\ i ) = i r~ (*4 + *8i + fen + *an + *mi) , (22) 

where fc 4 =E(A/j* A ), fc 31 - 4 °( 1 +^+ 2 " 2 ) E E ^ = 6 _^_ E ^2^, 

k 2 u = 12^± 2 |±g)E(AA 1 2 A )E(AA 1:A ) 2 and fc lm - 24 (1 _ °° (A/^) 4 . 



Proof. 



V- (in (^)) = Var f %I! - 1 ) ) = Var (||p t+1 | 



a t J J \2d a \ n J J \d\n 2 

E(l|P t+ ill 4 )-E(|| Pt+1 p) 2 

(23) 

The first part of Var(||p t+1 || 2 ), E(||p t+1 || 4 ), is equal to E((£™ =1 [Pt+i] -) 2 )- We de- 
velop it along the dimensions such that we can use the independence of [p t+ i]j with 

[Pt+i]jfori 7^ j,togetE(2EILiE"=i+i [Pt+i] ■ [Pt+i]* + E?=i [Pt+i] J)- For * ^ 
1 [Pt+i] j is distributed according to a standard normal distribution, so E ^ [Pt+i] 

landE([p t+1 ]J) =3. 



l 2 



n n 



E 



oiPmii 4 H2£ s E (y:) E (u3+E E ([^: 

i=l j=i+l i— 1 

/ n \ 

1 2 



n n 



2 E E 1 + 2 E E ([ p *+Ji) + E 3 +E ([Pi+i]i 

i=2 I j=2 \i=2 ) 



y 2 X> - J + 2(n - 1)E ( [ Pt+1 ] \) + 3(n - 1) + E ( [ Pt+1 ] J 
E (Wi]f) +2(n-l)E([p t+1 ]J) +( n -l)(n + l) 



The other part left is E(||p t+1 || ) , which we develop along the dimensions to get 
E(EILi kj!)' = (E( [p t+1 ]') + (n-l)) 2 , which equals to E([p (+1 ] J) 2 + 2(n- 
1)E( [p t+ i] ,) + (n — l) 2 - So by subtracting both parts we get 

E(||p t+1 1| 4 ) - E(||p t+1 1| 2 ) 2 - E( [p t+1 ] \) - E( [p t+1 ] ') 2 + 2(n - 1), which we insert 
into Eq. (23) to get Eq. (21). 

The development of E([p t+1 ] 2 ) is the same than the one done in the proof of 

Theorem 3, that is E([p t+1 ] 2 ) = (2 - 2c)/cE(Af 1:X ) 2 + E(A/' 1 2 :A ). We now develop 

E( [p t+1 ] \). We have E( [p t+1 ] \) = E(((l- C ) f [p } 1 + ^{2^cj ^(l-c)' [tt-i] 
We neglect in the limit when t goes to oo the part with (1 — c)*[p ]i, as it converges 
fast to 0. So 



limE [p t+1 ]^=limE c 2 (2-c) 2 ^(l-^^J . (24) 



To develop the RHS of Eq.(24) we use the following formula: for (a^etti-m]] 



m m 



\i— 1 / i— 1 i—l J — 1 i-1 



2 2 

a, a 3 



m m m m rn m m 

12 EE E 0^^ + 24^ E E E a * a i a k a i 

i=l j=l fc=j + l i=l j=2+l fc=j + l l=k+l 



(25) 



This formula will allow us to use the independence over time of from Lemma 1, 
so that E([£]? [£*] t ) - E([^*]J)E([^*] 1 ) = E(A^ :A )E(A^ 1:A ) for i ± j, and so on. 
We apply Eq (25) on Eq (22), with a = 1 — c. 



E 



([Pt+i]J 



lim c2 V _ = Um £ «<*E (A/? sA ) + 4 £ £ a 3 ^E (A/* A ) E (M ; x) 

V ' i=0 i=0 j=0 

+ 6E E a 2i+2j E (Ni, x ) 2 

i=0 

+ 12 EE E a 2i+J+fc E (A^i-a) E (M.:a) 2 

i=0 j=0 fe=j+l 

+ 24 E E E E « l+3+fc+i E(M:A) 4 (26) 

j=0 .j'=i+l k=j + l l=k+l 



We now have to develop each term of Eq. (26). 

1 - a 4 (* +1 ) 



E 



a 4 * = 



I -a 4 

2=0 



ii m y a 4i = — 

t-*oo ^— ' 1 — ( 



(27) 



2 = 



t t t-1 t t 2-1 



E E « 34+J -EE « 34+J + E E * 3l+3 P8) 

2=0 J=0 2 = j = 2 + l 2=1 j=0 



t-1 t It 

,4i+l 1 a 



a 



i=0 j=i+l t=0 
t-1 i t-1 

a 

a" 

t— >00 ' 

i=0 j=i+l " t=0 



lim V Y a 3l+ > = lim — V . 



(l-a)(l-o*) 



(29) 



t i-l 



EE« 3i+i = E« 3i ^ 



i=l j=0 i=l 



1 l-a 3t 4 l-a 4t 
a s — a 



1 - a V " 1 - a 3 1 - a 4 



t i-l 

3i+j 



1 



,4 



lim > > a , 

t^oo ^ ^— ' 1 - a V 1 - a 1-a 4 

i=l j=0 v 

_ a 3 (l-a 4 )-a 4 (l-a 3 ) 
~ (l-a)(l-a 3 )(l-a 4 ) 
_ a 3 ~a 4 
" (l-a)(l-a 3 )(l-a 4 ) 

By combining Eq (29) with Eq (30) to Eq (28) we get 



(30) 



lim > > 

t-lnn ^ — ' — ' 



t t 

3i+j 



a 



a(l-a 3 ) + a 3 -a 4 a(l + a 2 - 2a 3 ) 



(l-a)(l-a 3 )(l-a 4 ) (1 - a)(l - a 3 )(l - a 4 ) 
a(l-a)(l + a + 2a 2 )) a(l + a + 2a 2 )) 



(l-a)(l-a 3 )(l-a 4 ) (l-a 3 )(l-a 4 ) 
t— 1 t t— 1 1 2ft— i) 

E E « 2l+2j = E« 4l+2l r^ 

i=0 ,j'=i+l i=0 
t-1 t 9 t-1 



(31) 



limV V ffl wj " 
t->oo ^-^ 1 — a 2 ^— ' 

i=0j' = i+l i=0 

a 2 



(32) 



(1 -a 2 )(l - a 4 ) 

t t-1 t t i-2 i-l t-1 i-l t 

EE E fl2,+,+i =EE E a 24+j+fe + EE E a2l+J+k 

i=0 j=0 fc=j+l i=2 j=0 k=j + l i=l j=0 k=i+l 

t-2 t-1 t 

+ E E E a2l+3+k w 

i=0 j=i+l fe=j'+l 



t i— 2 i— 1 t i—2 ^ i—j— 1 

V" V" V" a 2i+j+k = V" V" a 2i+2j+l 1 ~ a 
i=2 j=0 k=j + l i=2 j=0 ^ a 

t 



1 1 n 2 (i-l) 1 n*- 1 

- l-a^ 1-a 2 1-a 

i=2 

1 / a 5 l-a 2 ^" 1 ) a 7 1 - a 4 **" 1 ) 



1-aVl-a 2 1-a 2 (1 - a 2 ) 1-a' 

a 6 l-a 3 (' +1 ) a 7 1 - a 4 ( t+1 ) 
~ 1-a 1-a 3 + 1-a 1-a 4 
a 5 / 1 a 2 
t^> l^a ^(1-a 2 ) 2 ~ (l-a 2 )(l-a 4 ) 
a a 2 (l + a) 

~ (1 - a)(l - a 3 ) + (1 + a)(l - a)(l - a 4 ) 
(1 + a 2 ) a 3 



t^oo 1 - a V (1 - a 2 ) 2 (l + a 2 ) (1 - a 2 )(l - a 4 ) 
a 



(l-a)(l-a 3 ), 

1 + a 2 + a 3 



t^oo l-aV(l + a)(l-a)(l-a 4 ) (l-a)(l-a 3 ) 

a 5 (1 + a 2 + a 3 )(l - a 3 ) - a(l + a)(l - a 4 ))) 
t^> (1 - a) 2 (1 + a)(l - a 3 )(l - a 4 ) 

5 1 + a 2 - a 5 - a 6 - (a + a 2 - a 5 - a 6 ) 
(l-a)(l-a 2 )(l-a 3 )(l-a 4 ) 
a 5 

(34) 



t^co (l-a 2 )(l-a 3 )(l-a 4 ) 



£ — 1 i— 1 £ £— 1 2 — 1 1 

EE E a 2t+j+fc = EE a3 ' +J+1 

i=l j=0 fe=i+l i=l j=0 

— >• lim > a JJ 

t^oo t^oo 1 — a ^— ^ 1 — a 



a / ol-a 3t 4 l-a 4t 
lim — — a — a 



t^oo t^oo (1 — a) 2 \ 1 — a 3 1 — a 

'a 3 (l-a 4 )-a 4 (l-a 3 ) 



t^oo (1-a) 2 V (l-a 3 )(l-a 4 ) 



t^co (1 - a) 2 (l - a 3 )(l - a 4 ) (1 - o)(l - a 3 )(l - a 4 ) 

(35) 



t-2 t—l t t-2 t-1 t _j 

E E E « 2i+j+fc = E E « 2i+2j+1 ~^ 



lim 



a 4^ 1 - a 2 ^-^- 1 ) 



1-a ^ 



t->oo t->oo 1 — a ^— ' 1 — a 2 

i=0 

a 3 l-a 4 ^ 1 ) 

— 5- lim -7 =r -. — 

t-¥oo t->oo (1 — a)(l - a 2 ) 1 - a 4 



t^oo (1 - a)(l - a 2 )(l - a 4 ) 
We now combine Eq (34), Eq. (35) and Eq. (34) in Eq. (33). 



EE E a2t+0+k — ^ a5 ( 1 - a ) +a4 ( 1 - fl2 ) + a3 ( 1 - a3 ) 

i=0 j=0 k=j+l 



i 5 (l - a) + a 4 (l - a 2 ) + a 3 (l - a 3 ^ 
t^oo (l-a)(l-a 2 )(l-a 3 )(l-a 4 ) 



a 3 + a 4 + a 5 - 3a 6 



t^oo (1 - a )(l - a 2 )(l - a 3 )(l - a 4 ) 

a 3 (l + 2a + 3a 2 ) 
t^> ((l-a 2 )(l-a 3 )(l-a 4 ) 



(36) 



(37) 



t— 3 t-2 t-1 t t-3 t-2 t-1 1 t _j. 

i+j+2fc+l 1 ~ a 



EE E E^ +fc+ ' = EE E * 

j=0 j=i+l k=j + l ;=fc+l i=0 j=i+l k=j + l 

t->-oo t-too 1 — a ^— ' ^— ' 1 — a 2 

i=0 j=i+l 

„3 *- 3 1 „3(i-2-i) 

_> lim - V g 4 »+ 3 1 ~ ° - 

t-i-oo t-i-oo (1 — a)(l — a 2 ) ^-jj 1-a 3 

a 6 1 - a^-V 

t->oo t^oo (1 - a )(l - a 2 )(l - a 3 ) 1-a 4 

a^ 

t^i (l-a)(l-a 2 )(l-a 3 )(l-a 4 ) (38) 

By factorising Eq. (27), Eq. (31), Eq. (32), Eq. (37) and Eq. (38) by we get the 
coefficients of Theorem 4. □ 

Figure 1 shows the time evolution of ln(a t /(Jo) for 5001 runs and c = 1 (left) and 
c = 1 /y/n (right). By comparing Figure la and Figure lb we observe smaller variations 
of ln((Tt /(Tq) with the smaller value of c. 

Figure 2 shows the relative standard deviation of In (at+i/at) (i.e. the standard 
deviation divided by its expected value). Lowering c, as shown in the left, decreases 




1( b 200 400 600 800 1000 10 100 200 300 400 500 

number of iterations number of iterations 

(a) Without cumulation (c = 1) (b) With cumulation (c = 1 /V20) 

Fig. 1: ln(er t /fTo) against t. The different curves represent the quantiles of a set of 
5.10 3 + 1 samples, more precisely the 10 l -quantile and the 1 — 10~ l -quantile for i from 
1 to 4; and the median. We have n = 20 and A = 8. 



the relative standard deviation. To get a value below one, c must be smaller for larger 
dimension. In agreement with Theorem 4, In Figure 2, right, the relative standard de- 
viation increases like %fn with the dimension for constant c (three increasing curves). 
A careful study [ ] of the variance equation of Theorem 4 shows that for the choice 
ofc= 1/(1 + n"), if a > 1/3 the relative standard deviation converges to with 
\J (n 2a + n)/n 3a . Taking a = 1 /3 is a critical value where the relative standard devi- 
ation converges to 1/(\/2E(A/i : a) 2 )- On the other hand, lower values of a makes the 
relative standard deviation diverge with n/ 1-3 ")/ 2 . 



6 Summary 



We investigate throughout this paper the (1, A)-CSA-ES on affine linear functions com- 
posed with strictly increasing transformations. We find, in Theorem 3, the limit distri- 
bution for ln(rjt/fJo) /t and rigorously prove the desired behaviour of a with A > 3 for 
any c, and with A = 2 and cumulation (0 < c < 1): the step-size diverges geometrically 
fast. In contrast, without cumulation (c = 1) and with A = 2, a random walk on m(cr) 
occurs, like for the (1, 2)-ctSA-ES [ ] (and also for the same symmetry reason). We de- 
rive an expression for the variance of the step-size increment. On linear functions when 
c = l/n a , for a > (a = meaning c constant) and for n — > oo the standard de- 
viation is about (n 2a + n) /n 3a times larger than the step-size increment. From this 
follows that keeping c < 1 /n 1 / 3 ensures that the standard deviation of ln(er t+1 /fT f ) be- 
comes negligible compared to h\(a t +i/ at) when the dimensions goes to infinity. That 
means, the signal to noise ratio goes to zero, giving the algorithm strong stability. The 
result confirms that even the largest default cumulation parameter c = 1 /y/n is a stable 
choice. 




dimension of the search space 



Fig. 2: Standard deviation of In (<7f + i/rj t ) relatively to its expectation. Here A = 8. 
The curves were plotted using Eq. (21) and Eq. (22). On the left, curves for (right to 
left) n = 2, 20, 200 and 2000. On the right, different curves for (top to bottom) c = 1, 
0.5, 0.2, 1/(1 + n 1 / 4 ), 1/(1 + n 1 / 3 ), 1/(1 + n 1 ' 2 ) and 1/(1 + n). 
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